Is GitHub Copilot a blessing, or a curse?

2021-07-204:08170193www.fast.ai

Making neural nets uncool again

Show article

Written: 19 Jul 2021 by Jeremy Howard

GitHub Copilot is a new service from GitHub and OpenAI, described as “Your AI pair programmer”. It is a plugin to Visual Studio Code which auto-generates code for you based on the contents of the current file, and your current cursor location.

It really feels quite magical to use. For example, here I’ve typed the name and docstring of a function which should “Write text to file fname”:

The grey body of the function has been entirely written for me by Copilot! I just hit Tab on my keyboard, and the suggestion gets accepted and inserted into my code.

This is certainly not the first “AI powered” program synthesis tool. GitHub’s Natural Language Semantic Code Search in 2018 demonstrated finding code examples using plain English descriptions. Tabnine has provided “AI powered” code completion for a few years now. Where Copilot differs is that it can generate entire multi-line functions and even documentation and tests, based on the full context of a file of code.

This is particularly exciting for us at fast.ai because it holds the promise that it may lower the barrier to coding, which would greatly help us in our mission. Therefore, I was particularly keen to dive into Copilot. However, as we’ll see, I’m not yet convinced that Copilot is actually a blessing. It may even turn out to be a curse.

Copilot is powered by a deep neural network language model called Codex, which was trained on public code repositories on GitHub. This is of particular interest to me, since in back in 2017 I was the first person to demonstrate that a general purpose language model can be fine-tuned to get state of the art results on a wide range of NLP problems. I developed and showed that as part of a fast.ai lesson. Sebastian Ruder and I then fleshed out the approach and wrote a paper, which was published in 2018 by the Association for Computational Linguistics (ACL). OpenAI’s Alec Radford told me that this paper inspired him to create GPT, which Codex is based on. Here’s the moment from that lesson where I showed for the first time that language model fine tuning gives a start of the art result in classifying IMDB sentiment:

A language model is trained to guess missing words in a piece of text. The traditional “ngram” approach used in previous years can not do a good job of this, since context is required to guess correctly. For instance, consider how you would go about filling in the missing words in each of these examples:

Knowing that in one case “hot day” is correct, but in another that “hot dog” is correct, requires reading and (to some extent) understanding the whole sentence. The Codex language model learns to guess missing symbols in programming code, so it has to learn a lot about the structure and meaning of computer code. As we’ll discuss later, language models do have some significant limitations that are fundamentally due to how they’re created.

The fact that Copilot is trained on publicly available code, under a variety of licenses, has led to many discussions about the ethical and legal implications. Since this has been widely discussed I won’t go into it further here, other than to point out one clear legal issue for users of Copilot discussed by IP lawyer Kate Downing, which is that in some cases using Copilot’s suggestions may be a breach of license (or require relicensing your own work under a GPL-compatible license):

“The more complex and lengthy the suggestion, the more likely it has some sort of copyrightable expression.”

Walk-through

Before we dive into Copilot more deeply, let’s walk-through some more examples of using it in practice.

In order to know whether that auto-generated write_text function actually works, we need a test. Let’s get Copilot to write that too! In this case, I just typed in the name of my test function, and Copilot filled in the docstring for me:

After accepting that suggestion, Copilot got a bit confused and suggested a meaningless function containing many lines of near duplicate code:

No problem – Copilot can show us other suggested options, by hitting Ctrl-Enter. The first listed actually looks pretty reasonable (except for an odd extra tab character in the first line):

Here Copilot assumes there’s a function available called read_text, which doesn’t actually exist. But it’s easy enough for us to ask Copilot to write and test that for us too.

I also asked Copilot to create a function for me that would “Tar and optionally compress directory dir to dest in bz2, xz, or gzip format”, which resulted in:

def tar_dir(dir, dest): "Tar and optionally compress directory `dir` to `dest` in bz2, xz, or gzip format" if dest.endswith('.bz2'): with tarfile.open(dest,'w:bz2') as f: f.add(dir, arcname=os.path.basename(dir)) elif dest.endswith('.xz'): with tarfile.open(dest,'w:xz') as f: f.add(dir, arcname=os.path.basename(dir)) elif dest.endswith('.gz'): with tarfile.open(dest,'w:gz') as f: f.add(dir, arcname=os.path.basename(dir)) else: raise ValueError('bad dest: {}'.format(dest))

I also had Copilot create a test using the same basic approach as above, and it wrote:

with TemporaryDirectory() as tmp: write_text(os.path.join(tmp,'a.txt'),'hello') write_text(os.path.join(tmp,'b.txt'),'world') tar_dir(tmp,'out.tar.bz2') assert os.path.exists('out.tar.bz2') with tarfile.open('out.tar.bz2') as f: assert f.getnames()==['a.txt','b.txt']

That test doesn’t actually pass, since getnames in the last line includes the parent directory, but it’s an easy enough fix from there. Copilot even cleverly decided to use the write_text function I created earlier, which is something I wasn’t expecting.

You can even use Copilot to write prose. I’m writing this blog post in vscode right now, and just clicked the “enable Copilot” button. After I typed the previous sentence, here’s what Copilot recommended as a completion:

“I can now write my blog post in a single line of text, and Copilot will generate the rest of the post for me”

Clearly Copilot has a rather inflated understanding of its own prose generation capabilities!

Code problems

The code Copilot writes is not very good code. For instance, consider the tar_dir function above. There’s a lot of duplicate code there, which means more code to maintain in the future, and more code for a reader to understand. In addition, the docstring said “optionally compress”, but the generated code always compresses. We could fix these issues by writing it this way instead:

def tar_dir(dir, dest): "Tar and optionally compress directory `dir` to `dest` in bz2, xz, or gzip format" suf = ':' + Path(dest).suffix[1:] if suf==':tar': suf='' with tarfile.open(dest,f'w{suf}') as f: f.add(dir, arcname=dir)

A bigger problem is that both write_text and tar_dir shouldn’t have been written at all, since the functionality for both is already provided by Python’s standard library (as pathlib’s write_text and shutil’s make_archive). The standard library versions are also better, with pathlib’s write_text doing additional error checking and supporting text encoding and error handling, and make_archive supporting zip files and any other archive format you register.

Why Copilot writes bad code

According to OpenAI’s paper, Codex only gives the correct answer 29% of the time. And, as we’ve seen, the code it writes is generally poorly refactored and fails to take full advantage of existing solutions (even when they’re in Python’s standard library).

Copilot has read GitHub’s entire public code archive, consisting of tens of millions of repositories, including code from many of the world’s best programmers. Given this, why does Copilot write such crappy code?

The reason is because of how language models work. They show how, on average, most people write. They don’t have any sense of what’s correct or what’s good. Most code on GitHub is (by software standards) pretty old, and (by definition) written by average programmers. Copilot spits out it’s best guess as to what those programmers might write if they were writing the same file that you are. OpenAI discuss this in their Codex paper:

“As with other large language models trained on a next-token prediction objective, Codex will generate code that is as similar as possible to its training distribution. One consequence of this is that such models may do things that are unhelpful for the user”

One important way that Copilot is worse than those average programmers is that it doesn’t even try to compile the code or check that it works or consider whether it actually does what the docs say it should do. Also, Codex was not trained on code created in the last year or two, so it’s entirely missing recent versions, libraries, and language features. For instance, prompting it to create fastai code results only in proposals that use the v1 API, rather than v2, which was released around a year ago.

Complaining about the quality of the code written by Copilot feels a bit like coming across a talking dog, and complaining about its diction. The fact that it’s talking at all is impressive enough!

Let’s be clear: The fact that Copilot (and Codex) writes reasonable-looking code is an amazing achievement. From a machine learning and language synthesis research point of view, it’s a big step forward.

But we also need to be clear that reasonable-looking code that doesn’t work, doesn’t check edge cases, and uses obsolete methods, and is verbose and creates technical debt, can be a big problem.

The problems with auto-generated code

Code creation tools have been around nearly as long as code has been around. And they’ve been controversial throughout their history.

Most time coding is not taken up in writing code, but with designing, debugging, and maintaining code. When code is automatically generated, it’s easy to end up with a lot more of it. That’s not necessarily a problem, if all you have to do to maintain or debug it is to modify the source which the code is auto-generated from, such as when using code template tools. Even then, things can get confusing when debugging, since the debugger and stack traces will generally point at the verbose generated code, not at the templated source.

With Copilot, we don’t have any of these upsides. We nearly always have to modify the code that’s created, and if we want to change how it works, we can’t just go back and change the prompt. We have to debug the generated code directly.

As a rule of thumb, less code means less to maintain and understand. Copilot’s code is verbose, and it’s so easy to generate lots of it that you’re likely to end up with a lot of code!

Python has rich dynamic and meta-programming features that greatly reduce the need for code generation. I’ve heard a number of programmers say that they like that Copilot writes a lot of boilerplate for you. However, I almost never write any boilerplate anyway – any time in the past I found myself needing boilerplate, I used dynamic Python to refactor the boilerplate out so I didn’t need to write it or generate it any more. For instance, in ghapi I used dynamic Python to create a complete interface to GitHub’s entire API in a package that weighs in at just 40kB (by comparison, an equivalent packages in Go contains over 100,000 lines of code, most of it auto-generated).

A very instructive example is what happened when I prompted Copilot with:

def finetune(folder, model): """fine tune pytorch model using images from folder and report results on validation set"""

With a very small amount of additional typing, it generated these 89 lines of code nearly entirely automatically! In one sense, that’s really impressive. It does indeed basically do what was requested – finetune a PyTorch model.

However, it finetunes the model badly. This model will train slowly, and result in poor accuracy. Fine tuning a model correctly requires considering things like handling batchnorm layer statistics, finetuning the head of the model before the body, picking a learning rate correctly, using an appropriate annealing schedule, and so forth. Also, we probably want to use mixed precision training on any CUDA GPU created in the last few years, and are likely to want to add better augmentation methods such as MixUp. Fixing the code to add these would require many hundreds of lines more code, and a lot of expertise in deep learning, or the use of a higher level API such as fastai, which can finetune a PyTorch model in 4 lines of code, resulting in something with higher accuracy, faster, and that’s more extensible.)

I’m not really sure what would be best for Copilot to do in this situation. I don’t think what it’s doing now is actually useful in practice, although it’s an impressive-looking demonstration.

I asked the fast.ai community for examples of times where Copilot had been helpful in writing code for them. One person told me they found it invaluable when they were writing a regex to extract comments from a string containing python code (since they wanted to map each parameter name in a function to its comment). I decided to try this for myself. Here’s the prompt for Copilot:

code_str = """def connect( host:str, # host to connect to port:int=80, # port to connect to ssl:bool=True, # whether to use SSL
) -> socket.socket: # the connected socket
"""
# regex to extract comments from strings looking like code_str

Here’s the generated code:

comment_re = re.compile(r'^\s*#.*$', re.MULTILINE)

This code doesn’t work, since the ^ character is incorrectly binding the match to the start of the line. It’s also not actually capturing the comment since it’s missing any capturing groups. (The second suggestion from Copilot correctly removes the ^ character, but still doesn’t include the capturing group.)

These are minor issues, however, compared to the big problem with this code, which is that a regex can’t actually parse Python comments correctly. For instance, this would fail, since the # in tag_prefix:str="#" would be incorrectly parsed as the start of a comment:

code_str = """def find_tags(
    input_str:str,     # the string to search for tags
    tag_prefix:str="#" # prefix marking the start of a tag
) -> List[str]:        # list of all tags found

It turns out that it’s not possible to correctly parse Python code using regular expressions. But Copilot did what we asked: in the prompt comment we explicitly asked for a regex, and that’s what Copilot gave us. The community member who provided this example did exactly that when they wrote their code, since they assumed that a regex was the correct way to solve this problem. (Although even when I tried removing “regex to” from the prompt Copilot still prompted to use a regex solution.) The issue in this case isn’t really that Copilot is doing something wrong, it’s that what it’s designed to do might not be what’s in the best interest of the programer.

GitHub markets Copilot as a “pair programmer”. But I’m not sure this really captures what it’s doing. A good pair programmer is someone who helps you question your assumptions, identify hidden problems, and see the bigger picture. Copilot doesn’t do any of those things – quite the opposite, it blindly assumes that your assumptions are appropriate and focuses entirely on churning out code based on the immediate context of where your text cursor is right now.

Cognitive Bias and AI Pair Programming

An AI pair programmer needs to work well with humans. And visa versa. However humans have two cognitive biases in particular that makes this difficult: automation bias and anchoring bias. Thanks to this pair of human foibles, we will all have a tendency to over-rely on Copilot’s proposals, even if we explicitly try not to do so.

Wikipedia describes automation bias as:

“the propensity for humans to favor suggestions from automated decision-making systems and to ignore contradictory information made without automation, even if it is correct”

Automation bias is already recognized as a significant problem in healthcare, where computer decision support systems are used widely. There are also many examples in the judicial and policing communities, such as the city official in California who incorrectly described an IBM Watson tool used for predictive policing: “With machine learning, with automation, there’s a 99% success, so that robot is—will be—99% accurate in telling us what is going to happen next”, leading the city mayor to say “Well, why aren’t we mounting .50-calibers [out there]?” (He claimed he was he was “being facetious.”) This kind of inflated belief about the capabilities of AI can also impact users of Copilot, especially programmers who are not confident of their own capabilities.

The Decision Lab describes anchoring bias as:

“a cognitive bias that causes us to rely too heavily on the first piece of information we are given about a topic.”

Anchoring bias has been very widely documented and is taught at many business schools as a useful tool, such as in negotiation and pricing.

When we’re typing into vscode, Copilot jumps in and suggests code completions entirely automatically and without any interaction on our part. That often means that before we’ve really had a chance to think about where we’re heading, Copilot has already plotted a path for us. Not only is this then the “first piece of information” we’re getting, but it’s also an example of “suggestions from automated decision making systems” – we’re getting a double-hit of cognitive biases to overcome! And it’s not just happening once, but every time we write just a few more words in our text editor.

Unfortunately, one of the things we know about cognitive biases is that just being aware of them isn’t enough to avoid being fooled by them. So this isn’t something GitHub can fix just through careful presentation of Copilot suggestions and user education.

Stack Overflow, Google, and API Usage Examples

Generally if a programmer doesn’t know how to do something, and isn’t using Copilot, they’ll Google it. For instance, the coder we discussed earlier who wanted to find parameters and comments in a string containing code might search for something like: “python extract parameter list from code regex”. The second result to this search is a Stack Overflow post with an accepted answer that correctly said it can’t be done with Python regular expressions. Instead, the answer suggested using a parser such as pyparsing. I then tried searching for “pyparsing python comments” and found that this module solves our exact problem.

I also tried searching for “extract comments from python file”, which gave a first result showing how to solve the problem using the Python standard library’s tokenize module. In this case, the requester introduced their problem by saying “I’m trying to write a program to extract comments in code that user enters. I tried to use regex, but found it difficult to write.*” Sounds familiar!

This took me a couple of minutes longer that finding a prompt for Copilot that gave an answer, but it resulted in me learning far more about the problem and the possible space of solutions. The Stack Overflow discussions helped me understand the challenges of dealing with quoted strings in Python, and also explained the limitations of Python’s regular expression engine.

In this case, I felt like the Copilot approach would be worse for both experienced and beginner programmers. Experienced programmers would need to spend time studying the various options proposed, recognize that they don’t correctly solve the problem, and then would have to search online for solutions anyway. Beginner programmers would likely feel like they’ve solved the problem, wouldn’t actually learn what they need to understand about limitations and capabilities of regular expressions, and would end up with broken code without even realizing it.

In addition to CoPilot, Microsoft, the owners of GitHub, have created a different but related product called “API Usage Examples”. Here’s an example taken directly from their web-site:

This tool looks for examples online of people using the API or library that you’re working with, and will provide examples of real code showing how it’s used, along with links to the source of the example. This is an interesting approach that’s somewhere between Stack Overflow (but misses the valuable discussions) and Copilot (but doesn’t provide proposals customized to your particular code context). The crucial extra piece here is that it links to the source. That means that the coder can actually see the full context of how other people are using that feature. The best ways to get better at coding are to read code and to write code. Helping coders find relevant code to read looks to be an excellent approach to both solving people’s problems whilst also helping them improve their skills.

Whether Microsoft’s API Usage Examples feature turns out the be great will really depend on their ability to rank code by quality, and show the best examples of usage. According to the product manager (on Twitter) this is something they’re currently working on.

Conclusions

I still don’t know the answer to the question in the title of this post, “Is GitHub Copilot a blessing, or a curse?” It could be a blessing to some, and a curse to others. For those for whom it’s a curse, they may not find that out for years, because the curse would be that they’re learning less, learning slower, increasing technical debt, and introducing subtle bugs – are all things that you might well not notice, particularly for newer developers.

Copilot might be more useful for languages that are high on boilerplate, and have limited meta-programming functionality, such as Go. (A lot of people today use templated code generation with Go for this reason.) Another area that it may be particularly suited to is experienced programmers working in unfamiliar languages, since it can help get the basic syntax right and point to library functions and common idioms.

The thing to remember is that Copilot is an early preview of a very new technology that’s going to get better and better. There will be many competitors popping up in the coming months and years, and GitHub will no doubt release new and better versions of their own tool.

To see real improvements in program synthesis, we’ll need to go beyond just language models, to a more holistic solution that incorporates best practices around human-computer interaction, software engineering, testing, and many other disciplines. Currently, Copilot feels like a product designed and implemented by machine learning researchers, rather than a complete solution incorporating all needed domain expertise. I’m sure that will change.

Read the original article

Comments

By ALittleLight 2021-07-205:345 reply

There's an interesting section that talks about what you want from a pair programmer - questioning your assumptions, spotting errors, debating design with you, etc, and how Copilot doesn't do this, but instead just spits out the code it thinks you want. That made me think that, when using Copilot, the human is actually the copilot. You're the one spotting bugs, picking different designs, and so on.

I also think the discussion around searching versus autocomplete is interesting. What we need is something that autogenerates a Google search based on your code context and shows you some promising results. I'm imagining a pane in VS Code getting populated with Stack Overflow answers whenever I slow down.

By seumars 2021-07-207:012 reply

For what it’s worth, Copilot is not being marketed as anything else than code suggestion software - an alternative to google search, perhaps more of a “I’m feeling lucky” without having to actually type the search term. Knowing how to succinctly express a problem is half the problem after all.

By ALittleLight 2021-07-207:162 reply

It's called "Copilot" and the tagline is "Your AI pair programmer." It's obviously being marketed as more than code suggestion software. It's being marketed as an AI pair programmer.

I agree that it is basically just code suggestion software, or like a big autocomplete. But it's definitely being marketed as more.

By soco 2021-07-207:592 reply

This sounds almost like Tesla's Autopilot, both in naming and in actual performance.

By toxik 2021-07-208:331 reply

Ha, I ranted about this two weeks ago. https://news.ycombinator.com/item?id=27724969

By JohnWhigham 2021-07-2010:571 reply

This is very true, and much like Tesla getting away with it for years (up until recently), GitHub will get away with this for a while as well. And so will the media decrying shit like "Is this the end of developers?!?"

By craftinator 2021-07-2012:33

> And so will the media decrying shit like "Is this the end of developers?!?"

Yeah, CoPilot already replaced my job. I used to be the office copypasta, going around to my coworkers desk all day long to fix their problems by googling them, then pasting in the first snippet to pop up. It didn't matter if the language matched or not, it was more the thought that counted. Now I've had to get a job as a loan rejection stamper at my local bank. It's not nearly as fun or exciting, but at least my effort makes a difference now. A big difference.

By wlesieutre 2021-07-2013:37

Full Self Coding Beta

By nimbius 2021-07-2014:47

the biggest concern and reason to pump the brakes on copilot is the unresolved elephant in the room: the growing consensus that it violates FOSS licenses.

By ahepp 2021-07-209:381 reply

Taken directly from the copilot product page:

>Tests without the toil. Tests are the backbone of any robust software engineering project. Import a unit test package, and let GitHub Copilot suggest tests that match your implementation code.

Accompanied by a picture of copilot filling out the implementation of a unit test, based on the test's name.

The implication seems very clearly that it is more than a google search. According to the ads, it understands your code well enough to write the unit tests for it!

By jasode 2021-07-2011:16

> According to the ads, it understands your code well enough to write the unit tests for it!

Yes, Github Copilot marketing has overstatements -- but it also has caveats about its limitations and creating errors.

In other words, the landing page has a mix of statements ... some hyping up the potential -- and some disclaimers to mention the reality and pitfalls.

Excerpt from that same Copilot landing page about it being wrong more often than right: https://copilot.github.com/

>How good is GitHub Copilot?

>We recently benchmarked against a set of Python functions that have good test coverage in open source repos. We blanked out the function bodies and asked GitHub Copilot to fill them in. The model got this right 43% of the time on the first try, and 57% of the time when allowed 10 attempts.

>Does GitHub Copilot write perfect code?

>No. GitHub Copilot tries to understand your intent and to generate the best code it can, but the code it suggests may not always work, or even make sense. While we are working hard to make GitHub Copilot better, code suggested by GitHub Copilot should be carefully tested, reviewed, and vetted, like any other code.

By tmp_anon_22 2021-07-2019:111 reply

> I'm imagining a pane in VS Code getting populated with Stack Overflow answers whenever I slow down.

Right but realize how much of a marginal gain we're fighting for here. You can already do this with a separate window and some extra clicking and typing. Saving a little clicking and typing isn't a huge win, unless you're also willing to argue the most valuable work a SWE does is the clicking and typing.

This whole thing feels ultimately like a PR spot that we're all tripping over ourselves to participate with. Its the SWE equivalent of getting all excited about how self-driving is just around the corner TM. It's a premature waste of energy until real advancements are made.

By ALittleLight 2021-07-217:14

I think the utility is similar to instant search on Google. I get very little by Google autocompleting the term I'm trying to type. I get quite a bit by Google suggesting things I didn't even know to search for.

The automatic SO search pane I was describing might give the same benefits. Most of the time maybe it would be useless. Sometimes it would show what I knew I wanted and save me the trivial effort of a search. The real gain would come from when it hit something that I should have asked but didn't think to. Then, I'd have a valuable insight I never would've had otherwise sitting in some pane ready for me.

Maybe it's just my imagination - but that's how I think it could be good.

By lucis 2021-07-2012:481 reply

You might want to take a look in https://www.tabnine.com/

By michaelbuckbee 2021-07-2012:511 reply

I've been using TabNine for a while and found it very helpful. It really does seem like more of a hyper-intelligent autocomplete than CoPilot which feels like it wants to write it for you.

By kortex 2021-07-2013:121 reply

The thing I love the most about tabnine is how it learns on your own code base. So if you have particular naming schemes or weird idioms, it eventually picks up on them.

By hboon 2021-07-212:581 reply

I've installed it a few weeks ago but haven't found it very useful yet. May I ask, which language(s) do you use it for and was it worth paying (I paid).

I'm using it primarily for Swift and some Kotlin.

By kortex 2021-07-2112:461 reply

Python, works quite well, golang works super well. C++ and JS are reasonable. Rust, you might as well turn it off. The more popular the language overall, the better it does. Swift and Kotlin might have too little to have refined the model on.

Another thing: I was using the free version in beta so it was effectively the full blown version. It worked super well. I changed jobs, changed hardware, and haven't set up my key yet (hopefully still grandfathered in) but it's felt not as good - the output is a lot shorter, for one. I would likely get the paid version if I lose my beta key. But it's weird that you have the paid version and it's not useful.

It's also possible they nerfed the models over time in the name of some other tradeoff, like cpu or memory. May still be able to configure it. But I distinctly remember it spitting out entire lines or chunks of code, exactly as I would have written them.

It works best when you are writing stuff similar to your own codebase. Perhaps try vendoring some Swift or Kotlin in your project dir to get the engine to fine-tune on it?

By hboon 2021-07-2113:03

Ah thanks. It’s a decent size Swift project. So it should have quite a bit to work off. Maybe it’s the settings. I’ll check what I can tweak.

By carschno 2021-07-205:461 reply

Something like the Stackoverflow Importer for Python? https://github.com/drathier/stack-overflow-import

By wildbook 2021-07-2012:01

It exists: https://github.com/MythicManiac/copilot-import

By jimmygrapes 2021-07-205:55

100% agreed; the copilot is human.

By zaptheimpaler 2021-07-206:0612 reply

Copilot was made from stealing code on Github, ignoring the licenses set on repos such as e.g GPLv2, using AI as a trick to license-launder code.

Copilot has announced their plans to become a paid service.

So this product that would not be possible without public, open-source code will itself be non-public, closed-source, closed-data. It is extracting value from the commons and funneling it to a private company.

By qiqitori 2021-07-209:157 reply

The following is just my opinion, and I'm not that hard-set in case anyone has any nice arguments.

Generally, when you (for example) use GANs to learn from (copyrighted) images and generate new images, I see no reason why those new images should inherit any copyrights from the original image if the resulting images look sufficiently different. (Obviously, if you just train on 1000 images of Mickey Mouse, you'll get mostly Mickey Mouse, and you wouldn't get the copyright.)

Humans work the same way -- artists train on existing images or subjects/etc., and unless they produce something that looks similar to an existing image, they get the copyright.

In other words, I don't think training on copyrighted code violates copyright (UNLESS the license explicitly disallows that; maybe there will be licenses for that soon). However, if generated code is too similar to existing code, then that could be a copyright violation. In other words, the user is responsible to make sure they aren't violating any copyrights from their usage of Copilot. It may be useful to build plagiarism checkers for this purpose. (Or maybe they already exist.)

If you use Copilot, I recommend keeping track of things that were generated automatically, just so you can go back to change those components if necessary.

By xdennis 2021-07-2013:471 reply

> Humans work the same way

This is the fundamental issue. They don't.

You can't sneak a camera into a cinema saying "don't worry, it works just like human eyes".

Neural nets aren't neurons.

By mikewhy 2021-07-2015:40

For a while, recording a movie was absolutely legal in Canada.

By bluesign 2021-07-2013:501 reply

What about I train millions of tagged images, then when I write ‘mickey mouse’ it spits something very similar to ‘mickey mouse’?

This is the what copilot does as far as I understand

By qiqitori 2021-07-2018:031 reply

Out of the replies I got so far, I liked this one best. Unfortunately the story is off the front page already, so I doubt there will be much more discussion.

In my view, artists work the same way. If you ask them to draw Mickey Mouse, they (maybe) will. It wouldn't be fair to say that they are infringing on Disney's copyright by storing images of Mickey Mouse in their brain. But their version of Mickey Mouse won't be copyrightable (unless they add parody/significant creativity etc.).

If we perform an _exact_ simulation of a human brain such that it believes it is a human, how will copyright law work? (Maybe it will own the copyright and turning the simulation off would be robocide. Okay, enough sci-fi.) If we remove consciousness and all that from the simulation, will the copyright go to its creator?

I saw in a couple other comments saying that machine learning is "just an algorithm"... But is machine learning sufficiently different from the way some parts of the human brain work to warrant being held to different standards?

My opinion is that it's reasonably similar and should have the same privileges that humans enjoy -- learning from whatever's in sight that is not explicitly marked as "for authorized personnel only".

By bluesign 2021-07-2217:05

Thanks for the answer, but here is a bit difference though in our understanding of law, I think if I draw Mickey Mouse now, it is derivative work. Disney can sue me. I think there should be some reasonable difference like trademark cases, people will not see it as Mickey Mouse.

But I think eventually we are at level of what is the minimum required duplicate to define something as duplicate. (I think this is more of a problem of music, similar songs etc)

I think main problem is we are not yet there (exact simulation of human brain), we are more like 'convince people to you are exact simulation of brain' stage.

Also another problem here is 'self trained/directed' vs 'trained/directed by someone'. Imagine I have a human artist (never seen Mickey Mouse), and in front of me some Mickey Mouse art, if I am giving directions to draw a mouse, then saying make in cartoon style, then saying make ears bigger, etc. Till I get the Mickey Mouse reasonably similar, even maybe exactly to the pixel in front of me, is it copyright violation, I wouldn't maybe 100% yes, but I am very closer to that.

By _ph_ 2021-07-2016:05

I think the catch is, a human is considered a creative being, as in it can create new content which consequently is copyrighted to that human. An AI cannot - at least so far - not create create genuinely new content and also cannot assume copyright.

If the copilot would have been trained on open source software e.g. to discover bugs, bad code style, or other things "learned" by analyzing existing code and using the results of this analysis as a metric for judging code, it wouldn't be a copyright problem. But creating new code based on what it "learned" is a much more difficult field, especially, if it is "quoting" so literally.

By enumjorge 2021-07-2014:20

> I don't think training on copyrighted code violates copyright

It might not be illegal, but for a lot of us it feels unethical. IMO licenses should list what they allow versus what they forbid. In other words, if something isn’t explicitly allowed it should be assumed to be forbidden.

We need similar protections for personal data too. Tech companies have gotten too used to ingesting data and profiting from it without asking for consent. I’m willing to bet that had GitHub asked repo owners whether GH could use their work to train a new product, most would have said no, much like what happened when Apple asked iOS users if they were ok with Facebook tracking them.

By b3morales 2021-07-2018:38

> In other words, I don't think training on copyrighted code violates copyright (UNLESS the license explicitly disallows that; maybe there will be licenses for that soon).

This was discussed on another thread: https://news.ycombinator.com/item?id=27740001

A new license saying "this license does not permit use in training an AI" won't have any effect, because the claim by the trainers is that they don't need to license the work in the first place. Unfortunately this is likely to be something that's only settled ad hoc in court.

By thereddaikon 2021-07-2015:16

Looks like its just copy pasting in code from ingested repos to me

https://twitter.com/mitsuhiko/status/1410886329924194309

By drran 2021-07-2011:301 reply

"AI training" is not a "training" in human sense. It's algorithm.

> In other words, I don't think using of an algorithm on copyrighted code violates copyright

It does.

By sweetheart 2021-07-2011:373 reply

Do you have a link discussing how it does? Ive only seen an article from a lawyer explaining how it doesn’t, in their professional opinion.

By UncleMeat 2021-07-2012:02

Welcome to software engineering, where engineers have extremely strong and fixed opinions about all sorts of fields that are completely disconnected from software engineering.

By drran 2021-07-2018:00

«A derivative work is based on a work that has already been copyrighted. The new work arises—or derives—from the previous work.

If you own the copyright to a work, you need to be aware that you also have rights to derivative works. If you're considering incorporating someone else's work into your new work, you need to be aware that you may be violating the copyright to the original work.»

https://www.legalzoom.com/articles/what-are-derivative-works...

By sombremesa 2021-07-2012:001 reply

Regardless of the nuances of that point, GitHub copilot violates copyright because the content it was trained on still lives in it, and you can get it to spit it out verbatim.

If I were a master artist, my existence wouldn’t violate copyright, but I would certainly be violating it every time I chose to reproduce a copyrighted work for a client.

By sweetheart 2021-07-2017:161 reply

Okay but do any lawyers agree with you?

By drran 2021-07-2018:03

Why not to ask them directly?

By onion2k 2021-07-207:553 reply

This is just business as usual for tech companies though. How many Silicon Valley businesses built their products on top of open source projects without contributing much back? Heck.. how many YC companies do that? When you see vital open source projects like OpenSSL struggling to raise more than a few tens of thousands in donations (it only continues through OSF contracting work), and libraries like Svelte barely clearing $30k, when you know they're used by Apple, Google, Facebook, Microsoft, etc you can't be that surprised when companies do the same thing on a larger scale.

The entire tech industry and all the multi-trillion and multi-billion dollar unicorn businesses that have billionaire founders and millionaire developers working at them fall under the description of "would not be possible without public, open-source code" and "will itself be non-public, closed-source, closed-data."

We can't reasonably claim there's anything wrong with a company building a product on the back of open source work when literally all of us do that as well. The only difference with Copilot is scale.

By dgb23 2021-07-209:191 reply

> We can't reasonably claim there's anything wrong with a company building a product on the back of open source work when literally all of us do that as well.

We cannot (are not allowed to, by law, contracts etc.) open source everything we write. I think the most important contribution is not funding but to contribute in some way or another for most.

Having said that, it is disheartening that important projects lack even basic funding. Is this getting better? There are also success stories that pop up with funding and donations becoming easier.

For companies open source funding can be a major thing to improve their relationship with the wider software community. It is a signal that they are interested in sustainability of software and collaboration. And it is a form of recognition for the authors.

Edit:

I just realized we are talking about GitHub here. They are a major contributor to open and free software as they provide free hosting and tooling. I'm not saying this absolves them of everything they do with Copilot, but I'm very, very happy to have such an amazing service freely available and many others are too.

By benatkin 2021-07-2013:08

GitHub isn't good for Open Source in my opinion. Debian is right to use their own GitLab instance. Lots of things seem good in the short term and are bad in the long term.

Edit: I don't really know what things would be like without GitHub. Same with YouTube and Facebook. I don't assume any of these have a net positive impact, though.

By yreg 2021-07-209:101 reply

Shaming companies for using OpenSSL for free without any strings attached seems to go against the idea of free software.

By onion2k 2021-07-209:171 reply

No part of any open source license says I have to like companies that extract billions in value from open source projects without contributing back to the project. My opinion is simply that once you get to a few million in revenue putting a few thousand back in to the code that got you there is a decent thing to do, and if you don't then you're not very nice.

The fact that the license allows you to do this is great; the fact that people actually do is not.

By drran 2021-07-2011:361 reply

GPL asks to release back changes to code made by a company, for which the company paid already, i.e. it's almost zero price for the company, and just look how much companies are afraid to donate zero ($0) worth of code back to the opensource project and prefer to pirate it instead, including such mega-rich companies as M$.

By messe 2021-07-2012:481 reply

> GPL asks to release back changes to code made by a company

Only if the code is distributed. SaaS can be (and is) used to get around this.

By grey_earthling 2021-07-2013:08

Hence the AGPL.

By tvirosi 2021-07-209:111 reply

There's a difference between building a product using open source and not contributing back, and copying licensed code into your own codebase. One is rude and the other is straight up illegal.

By onion2k 2021-07-209:195 reply

The code doesn't exist in Copilot. The instructions for how to recreate the code does. On a very pedantic level those are not the same, but it probably is enough to argue that the product is 100% legal.

It is still quite rude to do that though.

By dmurray 2021-07-2011:05

I don't think there's any water in that argument at a fundamental level. By that logic you could encode copyrighted information in any reversible format and it would be OK.

If I copy the code line by line with my eyes and keyboard, then the code briefly doesn't exist, but the instructions to recreate it do. Copying it is then a two-phase process: first I read the code, then I type it in. It is clear that I'm allowed to look at the code, so it's the action of typing it in that violates copyright. In the same way, distributing Copilot doesn't necessarily break copyright, but using it may.

By xdennis 2021-07-2014:04

I don't think people should look at this based on current law, but what is right.

We now need protection against machine learning (learnright?) because the only way to not work for Microsoft for free now is to not release your source code... so not open source. Remember when people said that Microsoft changed its mind on open source?

By IX-103 2021-07-2013:29

There's an argument that the model is a derivative work. In that case the original copyright still applies.

By dTal 2021-07-2010:23

So a .zip file of copyrighted code loses its license?

By hvis 2021-07-2011:071 reply

Did "instructions" recreate this?

https://twitter.com/stefankarpinski/status/14109710611816816...

By UncleMeat 2021-07-2012:04

Yes. Models are much much smaller than compressed training sets, making it very clear that they are doing more than just compressing the entire training set.

By FeepingCreature 2021-07-207:563 reply

Copilot generally (excepting rare cases where it produces snippets verbatim) does not steal code. The GPL restricts distribution, not usage. And (to my knowledge) no open-source license restricts learning from code. I cannot see anyone who doesn't want others to learn from their code ever release code as open-source.

By toxik 2021-07-208:396 reply

I as an open source author absolutely do not want Microsoft to get richer from using my code, code that I contributed or published for the benefits of other developers.

They took my work, removed my name and trained an advanced pattern matching technique to try to make code like mine and then sell it. It’s so obviously ethically questionable it’s insane.

Developers are absolutely pissed about this, and rightfully so.

By necovek 2021-07-208:562 reply

Not even copyleft licenses prohibit somebody from earning money from what you released, and that includes Microsoft. The idea behind free software is that it benefits all users equally, even if other developers get the biggest direct benefit.

The best question to ask yourself is if you would be annoyed as much if a company like Black Duck did a similar training or analysis with their OpenHub (openhub.net)?

I think one could even make a case for training an AI in this manner from the leaked Windows code: copyright law treats these generally as "fair use", though how you gained the copy of the code might still be illegal.

IANAL though :)

By drran 2021-07-2011:49

You need to comply with license first before you can use it to defense your position.

Copilot doesn't comply with opensource licenses, so authors of Copilot lost the right to use opensource licensed code permanently, until they settle the case with authors of the code.

By wffurr 2021-07-2017:591 reply

CC-BY-NC explicitly prohibits commercial usage.

It’s also common to see GPL license for non-commercial usage and paid licenses for commercial usage.

By necovek 2021-07-2019:33

Dual licensing with a copyleft license is common if you want to offer an ability for someone to develop a closed-source project: they can perfectly develop a GPL-licensed commercial project without paying anything.

If CC-BY-NC prohibits commercial use, it is not an open source or free software license (which have compatible definitions, but differ in motivations).

AFAIK, Creative Commons was set to create a set of licenses in the spirit of open source for creative works, and I wouldn't expect them to be open source at all.

By nanagojo 2021-07-209:271 reply

> I as an open source author absolutely do not want Microsoft to get richer from using my code

You are likely using the wrong license then.

By drran 2021-07-2011:51

Copilot doesn't comply with all open source licenses. Which one we should use then to protect our rights?

By nmfisher 2021-07-209:102 reply

Did you license your code under terms that allowed them to do so?

By tyingq 2021-07-2012:58

Most licenses would require attribution and some notion of the license(s) of the code behind the suggestion.

By toxik 2021-07-2018:13

Screwing over the little guy because he didn’t spend enough time contemplating possible legal troubles with his OS software seems, again, ethically dubious at best.

By iaml 2021-07-2013:01

I wonder if they could generate the correct license to code copilot produces, and maybe even infer the preferred one from repo and generate code that is restricted to that?

By twic 2021-07-209:491 reply

It doesn't matter what you want. You released your code under an open source license. What matters now is what the license says.

By b3morales 2021-07-2020:04

Well, the license probably requires attribution. Can you point to Copilot fulfilling that requirement?

By nojito 2021-07-208:431 reply

Then you should have never released it.

Fairly straight forward solution to your very unique problem.

By toxik 2021-07-2018:181 reply

The license for my software was written in a world where AI was not being used to replace me with my own code. Whatever license was chosen, was chosen to deal with the questions and issues known at the time.

It’s such a BS argument to say “your license didn’t anticipate the future, it’s your fault.” No, that’s not how law works.

Furthermore, law is not ethics. I said it’s ethically questionable because that’s what matters. Not if a court will find Microsoft guilty of some kind of overreach.

Anybody with even passing knowledge of law knows this, so please, stuff it somewhere.

By FeepingCreature 2021-07-214:51

> It’s such a BS argument to say “your license didn’t anticipate the future, it’s your fault.” No, that’s not how law works.

It literally is though.

You don't get to change your mind on an agreement because something happened that you didn't expect.

By xdennis 2021-07-2014:16

> Copilot generally (excepting rare cases where it produces snippets verbatim) does not steal code.

Rare exceptions are not acceptable in other situations.

If you, on rare exceptions, include copyrighted songs in your YouTube videos you still get strikes.

Citibank couldn't recover 900 million dollars it transferred too soon just because it was a rare mistake.

Microsoft shouldn't get a pass.

By garmaine 2021-07-209:042 reply

The GPL very much restrict derivative works. It's the whole point of the GPL. "Usage" in the context of the GPL does not have the meaning you are using.

By goodpoint 2021-07-2011:30

MIT/BSD also restrict derivative works by requiring attribution. Something that Copilot disregards.

By nanagojo 2021-07-209:291 reply

It is way more nuanced than that. For example if you never redistribute your work that was a fork from GPL code, then GPL states it's ok to never give back the source.

By garmaine 2021-07-209:41

What we both said is compatible and consistent. The derived work is restricted by the GPL's provisions. Those restrictions just don't require you to distribute the source on demand unless and until you distribute derivative works to other users.

By arvindamirtaa 2021-07-209:292 reply

I'm open to new perspectives. But here's where I stand so far.

If I learn programming from a book that is copyrighted and use a small snippet (for example, how to do a particular kind of sort) from it in my own program, am I violating a copyright?

By drran 2021-07-2011:57

Read the book copyright statement and license. Some books have separate license for examples.

However, if you copy an example from a book into a code, then very often it's fair use, but if you copy the same example from a book into your own book, then very often it's a copyright infringement, unless explicitly allowed by the book license.

By ekster 2021-07-209:32

Does the license of the book say that if you copy that snippet you can, but should provide attribution, and you don’t?

By ashtonkem 2021-07-2012:08

Turning the free contributions of an enthusiastic community and turning it into private, closed source wealth is a bit of a Microsoft tradition. Arguably Bill Gates did exactly the same thing with Microsoft; it was picked by IBM specifically because the community recommended it, a community despised by Gates who thought they were “stealing” from him.

By sweetheart 2021-07-2011:43

Using the word “extract” seems misleading to me. It has connotations of removing something, or exploiting scarcity. When we extract water from the Earth, there is less there for others to use. But in this case nothing is being removed from the code they trained on. I don’t mean to have the argument devolve into mere semantics, but I really think the use of that word demonstrates an assumption about the issue: people are perceiving a loss to the folks whose code was trained upon.

By elviswolcott 2021-07-206:191 reply

While I agree with your concerns about licensing and copilot, this criticism doesn't seem particularly relevant to the article that was shared.

By zaptheimpaler 2021-07-206:341 reply

IMO any place where Copilot is mentioned is a relevant place to put this. I don't know how anyone working in software can just turn a blind eye to shit like this. Anyone who uses Copilot is implicitly endorsing this theft.

We have a duty as practitioners in the industry to call it out when we see something wrong. If even devs aren't calling bullshit on Copilot, the media won't care, courts won't care, and it will be declared legal, and future theft will be normalized.

Its ridiculous how we all see the big tech companies doing various kinds of terrible shit and then the next new shiny thing comes along and everyone forgets all about it? Are you goldfish? What will it take to get someone to actually give a shit and stop supporting this kind of product/behavior?

By fraktl 2021-07-207:561 reply

> Anyone who uses Copilot is implicitly endorsing this theft.

Just like you're endorsing being sponsored and effectively stealing GitHub's bandwith? :)

> We have a duty as practitioners in the industry to call it out when we see something wrong.

It's wrong both ways. You accepted the service from GitHub, the free one where you get to open your account, host your code, share it etc.

What exactly did you expect? To be served for free for your entire lifetime?

Please, get off the moral highground. You ate the devil fruit, now you're whining about it. You should be smart enough to know that this kind of whining will get you nowhere. Just quit.

By zaptheimpaler 2021-07-208:032 reply

Basically you are saying any product thats free should be able to break the law arbitrarily? Google can decide to post your all email online, your location history for all time, your photos and you would be ok with that because its free? The undoubtedly free DNS server you use can leak all your requests too? You're ok with that?

Yes its free, no they did not tell me or ask for my permission before using the code in this way. Free does not give them the right to break existing laws and licenses. Its pretty simple.

By mariusor 2021-07-208:422 reply

However you did give them permission to use your code by the fact that you acceded to their terms and conditions[1] when you created an account. IANAL, I don't know if this section would hold to scrutiny in a court of law, but I'm pretty sure this is what their legal team considers to cover them when it comes to training Copilot on code hosted with them.

[1] https://docs.github.com/en/github/site-policy/github-terms-o...

By steerablesafe 2021-07-208:501 reply

People routinely share code on Github that is not owned or at least not fully owned by them, so they can't really rely only on the ToS.

By mariusor 2021-07-209:241 reply

See one paragraph above what I initially linked. They cover that also.

By steerablesafe 2021-07-2013:03

IANAL, but that is not my reading. They cover "Your Content" with the license grant, but not "any Content". The user still has the right to post "any Content" if they have the appropriate license to do so, but obviously they can't grant additional licenses to content the user doesn't own.

In my understanding your reading is that users uploading code that they don't own the copyright to, but otherwise have the right to copy through a license, are in violation with the ToS in general.

My reading is that the license grant only applies to "Your Content" as defined in the ToS, and otherwise users are free to upload code with permissive license and it _does not grant_ additional licenses to Github.

By b3morales 2021-07-2020:141 reply

The TOS is not a blanket grant for them to do anything they like with the material. As I said elsewhere: https://news.ycombinator.com/item?id=27823862

> Certainly the GitHub TOS grants them some common-sense ability to copy the code you upload so that they can usefully host it. Can you point to the portion that allows them to use it for Copilot?

> Because I'm pretty sure it doesn't. Section D4:

> > This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service...

By junar 2021-07-2023:40

> You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time

> The “Service” refers to the applications, software, products, and services provided by GitHub, including any Beta Previews.

By toxik 2021-07-208:421 reply

Google actually did do Copilot for Gmail. Nobody noticed though.

By RandomBK 2021-07-209:37

That's actually a great point. Ditto for GDocs. I've been pleasantly surprised at how good autocomplete suggestions have been in docs lately.

If I were to hazard a guess, I'd say that the vitriol around Copilot stems from five factors that distinguishes it from Google:

(1) The length of the suggestions alongside some of Copilot's marketing demonstrated that perhaps non-trivial replacement of engineers with AI might not be as far-fetched or far away as most people thought. Google's autocomplete has yet to make me feel replaceable.

(2) The content of the training data had a clearer intrinsic commercial value, making perceived license violations feel more 'real'.

(3) GitHub (historically) didn't have the same reputation as Google for training AI models on data uploaded to its free services. People likely (mis)placed some trust in GitHub when they uploaded code, and this backlash is part of the adjustment process.

(4) The indication that Copilot will eventually be a paid commercial service, effectively building a commercial service off the backs of millions of open source developers. While this is perfectly legal and common across all industries, it doesn't feel good.

(5) Copilot spitting out raw training data really doesn't help its image.

By icebraining 2021-07-208:473 reply

Which repo marked as GPLv2 has been used on Copilot? I think the trouble is that some repos marked as MIT/BSD actually contain GPL code.

Not that this excuses GitHub/Microsoft in any way, this was an obvious outcome and they're morally and legally responsible.

By TeMPOraL 2021-07-2011:46

Honestly, I think all this talk about GPL vs MIT/BSD is a red herring.

It doesn't matter whether the code is GPL or MIT or BSD. If Copilot reproduces it in your codebase, you're violating the license anyway - almost all FLOSS licenses carry an attribution requirement, which Copilot does not and can not reproduce[0].

The difference between GPL and MIT is whether you have to release your source code, or just add a blurb in README. It's a big one, but it's downstream from the core problem: with Copilot, you won't even know when you're violating some license - much less what to do about it.

[0] - The whole point of a DNN model is to pre-mix the inputs in training, so that responses to queries are cheap and fast. This comes at the cost of making it impossible to reverse-query the model, so the only way for Copilot to give correct attribution would be to take its output and run a search on the training data, which would kill all the costs savings they won by using a neural network.

By ivanbakel 2021-07-2011:48

As the other commenter pointed out, Copilot has ingested all the public code on GitHub, including GPL code.

And as this famous example[0] shows, the GAN is able to reproduce what is unquestionably copyrighted material from those repos.

[0]: https://twitter.com/mitsuhiko/status/1410886329924194309

By xomateix 2021-07-209:01

According to GitHub support, they didn't exclude any repo based on the license: https://news.ycombinator.com/item?id=27769440

By kangalioo 2021-07-207:312 reply

While I see how Copilot's data gathering can be considered unethical, in the end I ask myself, does it matter? Would it have been of any concrete advantage to open-source programmers if Copilot hadn't used their source code? I can't think of any

By tluyben2 2021-07-207:401 reply

Well, if you chose a viral license like the GPL, you do not do that to have your code and knowledge being reused in a non viral (possibly closed source) licensed solution. So this is a fundamental issue at least GPL minded authors will definitely mind and fight against.

By Dylan16807 2021-07-208:26

If you want to control knowledge you need a patent, not a copyright license.

(And the design of copilot is an attempt to extract only knowledge.)

By grp000 2021-07-206:15

I think a lot of artists and content creators would also like a word.

By tffgg 2021-07-209:48

Well, sue them if you think they stole it, and see if you are right

By fraktl 2021-07-207:323 reply

So what you're saying is that you have the right to host your code for free, to download it from GitHub for free (no bandwith paid) and to behave like new-age Robin Hood when it comes to being offended because the very same company use that code - not to run it, but analyze it. Stop being entitled, it's seriously hard to participate in any kind of progress with offendable cauliflowers like you.

By RandomBK 2021-07-208:27

Conversely, the generous olive branch of free hosting is not a blank check that allows GitHub to use the hosted code for any purpose, especially when that purpose wasn't made clear as part of the terms of the original free hosting offer.

When I uploaded my code to GitHub, it was done so with the understanding that in exchange for the hosting and bandwidth, GitHub was permitted to use the code in a set of limited ways, as spelled out in their terms of service. I understood that I was contributing to building and establishing GitHub's brand as the go-to place for open source collaboration, a brand which they have undoubtedly benefited from.

With Copilot, GitHub has extended that use in a way that was not made clear during that initial contract. Regardless of the legality of this change, it's normal and expected for some users to be "offended". This isn't "being entitled", but a legitimate response to what many perceive as a violation of the norms of this industry.

That doesn't even get to the ambiguous legal questions involved, particularly with licenses that go beyond the typical MIT/GPL licenses. Based on GitHub's statements, it sounds like any public repo was fair game. What does this mean in the context of AGPL and other more restrictive licenses?

By zaptheimpaler 2021-07-207:54

No I am happy to pay for it, and have paid for github in the past in fact. That doesn’t mean they can change the deal on what free hosting means with out any notice or method to opt out. Paid user’s public repos were not spared from Copilot either.

By lawtalkinghuman 2021-07-2010:01

The hosting of open source code on GitHub is not some completely selfless act on their part. GitHub's value proposition to commercial users is in part contributed to by the fact it is used by a lot of open source projects and for solo or hobby projects, thus breeding familiarity with the platform.

By eminence32 2021-07-204:401 reply

> The code Copilot writes is not very good code.

This has been my experience as well. Some of the early promo material for Copilot showed someone writing a function signature with docs, and having Copilot write the entire body. This rarely works for me, except for fairly trivial functions. This is also not how I normally write code.

However, where Copilot has been rather good is offering intelligent "tab completion", where it makes a suggestion for the next line of code. In particular, it seems very good at figuring out the structure of my code and making sensible suggestions.

I suspect that if anyone ends up using Copilot for the long term, they will (as a human operator) need to learn and gain intuition about how to use Copilot's strengths and avoid its weaknesses.

By skohan 2021-07-206:073 reply

Copilot is also almost certainly going to become better.

For example, NVIDIA's DLSS was viewed as a failure at launch, but now it's almost magic in terms of offering better results at a lower performance cost.

By outsomnia 2021-07-206:142 reply

Better at issuing haphazardly modified boilerplate that already exists... the real costs of writing code are about carefully choosing how pieces will fit together and deciding on minutae with that in mind.

Copilot is the antithesis of designing good software.

By UncleMeat 2021-07-2012:08

Both parts matter. Making humans faster at writing the well isolated implementations means more time for the important stuff. Tons of productivity features could be foolishly dismissed with "well, the actually important bit is design so this is worthless".

By skohan 2021-07-207:011 reply

AI is getting better than humans at a lot of complex task over time. Just because it can't code well now doesn't mean it won't in the future.

By krageon 2021-07-208:211 reply

What it means that it can't code well, so there's nothing to be excited about. That we cannot predict the future is of course true but really has no bearing on what you're saying, except to lend credence to the idea that it will be able to some day (i.e. that you can predict the future). Which is in essence the opposite of what "we cannot predict the future" should mean. That makes the entire argument kind of disingenuous.

By skohan 2021-07-2011:041 reply

We can't know the future, but we guess based on past trends. Technology has been pretty good at replacing humans in the past, and a lot of times we've said "a machine can't do this" and been wrong. We'll have to see, but I think it's silly to say it's not working now so it never will.

By krageon 2021-07-2014:441 reply

We've heard breathless claims that AI will replace people coding ever since the advent of AI in some form or another. If we truly were going on past trends, we would conclude it is not very useful and will probably stay that way.

By skohan 2021-07-2015:471 reply

Do you think Copilot doesn't represent a step forward since the advent of AI?

By mikro2nd 2021-07-2014:451 reply

It might also become worse. It seems to me that GitHub has opened themselves (well, this Copilot thing anyway) up to a form of DoS attack: Just imagine what might happen if a bunch of people opened up a number of public repos filled with utter crap?

Not suggesting anything, here, just musing on a possibility.

By skohan 2021-07-2015:491 reply

Well, that's a solvable problem. You just need a classifier to filter good code from bad. I would imagine they're already using analytics on the popularity/usage of the repos to weight the training data.

By mikro2nd 2021-07-239:04

LOL - I'm not sure about many humans' ability to tell good code from bad, given how prolifically they produce "bad" code.

By hippari 2021-07-207:591 reply

To me graphic seems to have a very specific goal of rendering more details & better simulation of real world.

People aren't even agreed on what is "good" code.

By skohan 2021-07-2015:561 reply

Yeah but only software engineers care about "good" code. Everyone else just wants their software to do it's job. If you can train AI to write code which is going to get the right outputs, it will win.

AI can probably also beat humans in terms of things like performance optimization. An AI can write code a human never would. We have to write short functions and well-factored code which fits in the human brain. An AI could theoretically write millions of lines of highly-specialized spaghetti which is perfectly correct and perfectly optimized.

By hippari 2021-08-039:59

Optimization AI can goes into the compiler, the source code is for human readability.

or actually they shouldn't, since it can cause undefined behaviors.