Nixers project: Bittorrent library

Nixers project: Bittorrent library - Community & Forums Related Discussions

Users browsing this thread: 2 Guest(s)

	z3bra Offline \| 30-07-2017, 04:40 AM \| #1

Hello fellow nixers!

I just started a new project, and though it would be great to involve more people in it!
this project is a bittorrent library in C, as there doesn't seem to be much out there (even though many people reimplement the protocol over and over). Here is a link to the project:

http://git.nixers.net/libgbt

The goal is to implement the full BTP/1.0 RFC and add on top of that the ability to handle the DHT protocol.

I know it might sound like a big project for some of you, but it is a good occasion to get started with C, or in a collaborative project!

Who's up for it?

BANGARANG, MOTHERFUCKER

blog | repo

	r4ndom Offline \| 30-07-2017, 05:33 AM \| #2

I would love to, but I have currently exams until mid-August.

After that I'm trying to catch up and help :)

	Tmplt Offline \| 30-07-2017, 07:20 AM \| #3

Cool! Is the plan to extend synk(1) with this?

I'm definitely up for it. It's about time I learn C.

	z3bra Offline \| 30-07-2017, 07:22 AM \| #4

That's the plan, but the library will be an all-purpose lib, so we'll write it without thinking about synk(1) :)

I started writing a bit of code, be sure to check it out!

	Tmplt Offline \| 30-07-2017, 07:31 AM \| #5

I did, but I believe I need to (at least) skim through the RFC.

Is "gbt" an abbreviation for grizzly bittorrent?

	z3bra Offline \| 30-07-2017, 09:15 AM \| #6

(30-07-2017, 07:31 AM)Tmplt Wrote: Is "gbt" an abbreviation for grizzly bittorrent?

Indeed!

My plan was to implement the RFC from top to bottom, so I started with bencoding.
I didn't bother much checking the rest of the RFC.

	Tmplt Offline \| 30-07-2017, 09:45 AM \| #7

(30-07-2017, 09:15 AM)z3bra Wrote: My plan was to implement the RFC from top to bottom, so I started with bencoding.

That's a smart way of doing it. I'll see if there is anything I can do when I have access to my computer again.

	z3bra Offline \| 30-07-2017, 10:42 AM \| #8

I started writing some tests for the bencoding_parse() function, but they're not reoiable enough, so it might be a good idea to write the tests first.

	z3bra Offline \| 30-07-2017, 08:45 PM \| #9

The library can now parse integers, strings and lists in "Bencoding" format (check the RFC).
I added a test case as well, which reveals a segfault whenever an incorrect char is parsed (this should be fixed obviously). I also don't free a single bit of memory yet (I know, I know...), so we need a proper way to free a whole bencoding data struct.

Something hard as well: NAMING. functions are "bencoding_*", variables/types "be*"... We need consistency, but I don't feel like using the full format name for variables. It will become unreadable.
As the next sections are THP and PWP, I was considering "ben" as a prefix for everything. I'm just wondering if that makes sense...

	venam Offline \| 31-07-2017, 12:22 AM \| #10

(30-07-2017, 08:45 PM)z3bra Wrote: I also don't free a single bit of memory yet (I know, I know...), so we need a proper way to free a whole bencoding data struct.

Have you thought about including a simple garbage collector such as libgc ( http://www.hboehm.info/gc/ ). It might not be as flexible as freeing the memory manually but it's still handy.

	z3bra Offline \| 31-07-2017, 03:36 AM \| #11

No I didn't think about it. I want to avoid external dependencies as much as possible, and for things I cannot handle myself (HTTP protocol, crypot, ...).

Freeing memory is some basic concept, so I don't feel like having an external library for the sake of being lazy :)

	jkl Offline \| 31-07-2017, 04:13 AM \| #12

Usually, your compiler's static analyzer (you do use a decent compiler, don't you?) should find forgotten mallocs for you and annoy you to fix them. Having a library to do that is only one step behind just using Java or something. /s

--
<mort> choosing a terrible license just to be spiteful towards others is possibly the most tux0r thing I've ever seen

	z3bra Offline \| 31-07-2017, 07:18 AM \| #13

(31-07-2017, 04:13 AM)jkl Wrote: Usually, your compiler's static analyzer (you do use a decent compiler, don't you?) should find forgotten mallocs for you and annoy you to fix them. Having a library to do that is only one step behind just using Java or something. /s

I use valgrind to hunt forgotten mallocs. It works well enough for my needs. You're right about the lib, and I don't see the point of using a library to do garbage collection when some languages offer it natively.

	sff Offline \| 31-07-2017, 01:28 PM \| #14

What a cool looking project! As someone who knows a little C (I was taught C++ at uni last year), is there something I can do to contribute? Or should I try to get more accustomed to C before trying to help? I'm looking forward to see the evolution of this project.

	z3bra Offline \| 01-08-2017, 04:19 AM \| #15

No need to get a a pro in C, start reading what is already there, and ask questions about things you don't understand, it will help others. Then fire up your editor, and start coding!

	r4ndom Offline \| 01-08-2017, 05:31 PM \| #16

Isn't the dictionary missing in your implementation?

And do we create branches or are we working on master?

	z3bra Offline \| 01-08-2017, 06:19 PM \| #17

(01-08-2017, 05:31 PM)r4ndom Wrote: Isn't the dictionary missing in your implementation?

And do we create branches or are we working on master?

A lot of things are missing! Dictionnaries are next on my list. Actually, they're the same thing as lists, except that there is a relation between each pair of member. So the difference really stands in the usage you made of them, rather than how you store them. For reference, this commit adds the dictionnary parsing functionality :)

What needs to be implemented now is skimming through dictionnaries to find a specify key.

As you can see, I'm working directly on master. I prefer linear histories and rebasing commits upon merging is always a pain (also, stagit won't show the branch history, but that's another story). I'd advise you to work on master, unless you want to try a new feature that might not emd up on master (eg. it will be painful to rollback) or if you want to work separately for a later review (might be great if you're not confident enough in your code).

The end goal is to keep the master easy to browse, so people joining can easily replay the project history with "git log".

	z3bra Offline \| 02-08-2017, 03:53 AM \| #18

I finaly implemented key search in dictionaries. That was way easier than expected ^^

I also wrote a TODO file! with a summary of what (i think) should be done. You can pick a task from here and start working if you want.

I guess we can use this thread to discuss who's doing what for now?
For anyone that's new to C, I'd suggest doing the "bencode" function :) It's already written in the test case, but in a quick and dirty way. The goal is to implement it in the library directly, and expose it through the API.
Who feels up for it? (I can help if needed, no worries)

	r4ndom Offline \| 02-08-2017, 06:24 AM \| #19

(02-08-2017, 03:53 AM)z3bra Wrote: I finaly implemented key search in dictionaries. That was way easier than expected ^^

Nice. But what I do not get: There is a `parseint`, `parsestr`, and `parselist` function, but why no `parsedict`. I know parselist and parsedict are somewhat the same, but isn't it counterintuitive to just have a parselist and no parsedict.
Or is this irrelevant, since what is currently writte is just the logic within the library and the API is the point where verbosity matters?

(02-08-2017, 03:53 AM)z3bra Wrote: Who feels up for it? (I can help if needed, no worries)

Sure, I can do it. But beforehand I have some questions:

Is there any short explanation/tutorial for a TAILQ (besides the manual), because some things aren't clear for me (like when I init a TAILQ, where is it stored?). If I'm home I can skim through my C book to find something, but maybe there isn't something about it (its "learn c the hard way").
How do I implement an API? What are the important parts about it? I'll propably going to take a look at the wmlib, which probably have one. But a short explanation is always appreciated :)

_____
Thanks for fixing the dictionary typo :P

	z3bra Offline \| 02-08-2017, 08:27 AM \| #20

You could always create something like this:

Code:
struct blist *

parsedict(FILE *stream)

{

    return parsrlist(stream);

}

But IMO it clutters the code, and anyway "bparselist" is not a good name for exposing it. I was considering a more general name, ala "bdecode()" (either renaming bparselist(), or making it a wrapper).

For TAILQ, the man page is good enough IMO, but you need to understand the concept of linked-lists before.
We can discuss it on IRC if ypu want.
For the API, it's just about finding good names for functions and how to use it (its more a discussion than actual code though).

	xero Offline \| 02-08-2017, 07:27 PM \| #21

i never trust HDT
always russian hax0rz in that pool

	z3bra Offline \| 03-08-2017, 03:31 AM \| #22

Then that's a good way to prove this library is strong as a grizzly!

	r4ndom Offline \| 03-08-2017, 04:57 AM \| #23

(02-08-2017, 07:27 PM)xero Wrote: HDT

HDT? What's that?

	z3bra Offline \| 03-08-2017, 07:59 AM \| #24

A typo, he meant DHT (distributed hash table)

	josuah Offline \| 05-08-2017, 04:06 PM \| #25

It could be good to have a small client along with it, just like lib bearssl have brssl. At least for testing if works across the whole chain.

If libgbt needs other libraries, this opens a library chain: someone writing a wrapper library over libgbt to build its client more "dev friendly", which is in turn used in a GUI widget to stream videos used in a larger program.

What about an open and free implementation of sha1.c and md5.c in the repo ?

Maybe it could not support md5 at all as it is quite deprecated in the RFC.

	z3bra Offline \| 05-08-2017, 06:18 PM \| #26

Having a client shipped with it is indeed a good idea. The library is too young yet though.
It's you're talking about free impl. of SHA1, as I just started exporting sha1.c from libtomcrypt!
For MD5, we could do the same if needed

	josuah Offline \| 05-08-2017, 07:21 PM \| #27

(05-08-2017, 06:18 PM)z3bra Wrote: The library is too young yet though.

A tool to produce a bencoded .torrent metainfo file is already useful, and will permit to test bencoding and sha1 against other software already. First little reward. :)

(30-07-2017, 08:45 PM)z3bra Wrote: "ben" as a prefix for everything

"ben" prefix is fine for me, maintaining a list of prefix in comments would make it easy for devs new to the project?

[EDIT]: .torrent metainfo file != torrent file

	z3bra Offline \| 06-08-2017, 07:10 AM \| #28

r4ndom is working on the bencode() function right now. I wrote a small code to dump a torrent's content to stdout, but it's not really useful. For such tools, I don't think it's a good idea to include them in the repo, as we'll end up with many tools for no practical reason. I prefer keeping them outside the tree until the lib is feature complete.

As for the prefixes, I settled on simply 'b' for bencoding related function. And no, a list of prefixes in comment shouldn't be needed as it should be obvious when reading existing code. That's also why the first patches someone submits should be reviewed by existing contributors

	josuah Offline \| 06-08-2017, 01:10 PM \| #29

(06-08-2017, 07:10 AM)z3bra Wrote: As for the prefixes, I settled on simply 'b' for bencoding related function

Perfect :)

(06-08-2017, 07:10 AM)z3bra Wrote: to include them in the repo

Maybe some could be turned into tests.

It may be a bit early to think about it, but this came on its own while reading about bittorrent.
Once we start to transfer data from a peer, how do we store it? Here is a prososition:

An approach is to store the parts into files and directories: Parts gets downloaded to a memory buffer, and once one is complete, it gets saved to the disk as <hash of torrent>/<hash of the part>:

Code:
|-- 2072a695613e5103d9ac03c2885c5e2656cb5ff0      # hash of the torrent #1

|   |

|   |-- 80c50e142f978130d9a69b4a15267896f0d72abe  # parts of the torrent

|   |-- d345da49668dbfc05d559e18e5d5975951fc41ac  # named after their hash

|   `-- 376456435d4ceec3acb6ab963107280ef80aca1b  # one file per part

|

`-- 903e39fd73579cfe5a2d97daa6ec9bcc61cd01cc      # hash of the torrent #2

    |

    `-- 376456435d4ceec3acb6ab963107280ef80aca1b  # has the same part as #1

Advantages:

permit multiple workers (threads, processes...) to read the torrents parts at the same time.
very intuitive and transparent for the end user or dev writing a client: save part <phash> of a torrent <thash> == save to somedir/thash/phash
very easy to check the integrity oh the parts:
filename_of_part == computed_hash_of_part
low memory usage: only the piece currently being downloaded needs to be cached into a memory buffer, the rest goes to the disk.

Disadvantage:

memory usage could be improved by storing data into files as it gets received, at the price of atomicity. Or having a file listing the complete parts, but that is less intuitive.
parts that are similar across multiple torrents gets downloaded twice.
what other problems? do you want another design?

To overcome this, it would be possible to store every single part in an unique dir for all the torrents, but then, race condition could occur: if two process/threads download the same part at the same time, the first one write it to the disk.

Instead, before starting a download, a worker could seek in the parts directories of the other torrent if it can find the existing part.

I would go to the simplest way, with one directory per torrent, which still permit optimizations.

[EDIT] torrent != parts of a torrent

	josuah Offline \| 06-08-2017, 05:00 PM \| #30

After talking a bit on ##bittorrent @freenode, I learned how clients seems to implement it:

Some put parts on multiple files in some way or another (like above).

But most are putting the parts directly in the torrent file:

1 - Write parts at the beginning of the torrent file (the full data blob, not the .torrent metainfo file), and sort them as they come:

Let say I have part 3: I put it in the beginning of a new file
Then I get part 1, I put it in front of part 3
Finally I get part 2, and put it in-between

This saves space, at the risk of "not enough space on disk" error if the file gets two big. You need to keep a track of where are the parts, and wait the sorting is finished while before to read/write further parts. This temporarily saves space on disk, but looks quite complex.

2 - Or they allocate storage for the file (such as an empty 2GB file) and fill it with the parts as they come, writing them with the correct offset. This way is much simpler: as you have a list of which part goes where, there is no sorting involved: read where should the part go, an you have where you should read it.

With this latter approach, in the case of multifile torrents:

fill one big file with all the parts, and split it when all the parts are there
directly split the parts as they come.

These two approaches (1- and 2-) has an advantage: no need to keep the parts files (which cost a lot of storage [EDIT] and inodes). On the other hand, if the final file is moved, it can not be seeded anymore.

If I was me, I would still do one file per part, but you are no me. :)

1 2 Next »

View a Printable Version