Regarding stdlib functions #100

Jolanrensen · 2021-07-17T12:06:14Z

One of the things that makes Kotlin so great to work with, compared to other languages, is the extensive and declarative standard library functions.
Functions like mapNotNull { } and first { a > 4 }. To promote Kotlin for Spark, it might be helpful to bring the standard library closer to Datasets and RDD calculations.

There are multiple ways we could achieve this.
The first way is to simply convert Datasets to Iterables and Sequences:

inline fun <reified T> Dataset<T>.asSequence(): Sequence<T> = Sequence { toLocalIterator() }
inline fun <reified T> Dataset<T>.asIterable(): Iterable<T> = Iterable { toLocalIterator() }

However, I am not sure whether this would impact performance since the Spark functions like filter, map etc. are probably optimized.

The second option would be to copy the standard library functions for Sequences/Iterables and put them in place as extensions for Datasets and RRDs.

What do you think, @asm0dey ?

The text was updated successfully, but these errors were encountered:

asm0dey · 2021-07-17T14:53:48Z

The is already collectAsList and I hope that this is enough. The thing is all these operations will happen on the master node, so it's actually will anyways move all data to the memory so these operations are not too different from collectAsList. And you always can call collectAsList().asSequence(), am I right?

Jolanrensen · 2021-07-17T22:20:07Z

@asm0dey yes that's right. That's already possible, but it might be heavy, since it needs to first collect everything in memory and then run the stdlib functions on the resulting list.

However, if we would reimplement all or most stdlib functions for Datasets separately using the map and filter functions already present in Spark, it might be possible to make them more efficient.
I'll try a few out to see if it will work. In the best case we get some extra helpful functions like mapNotNull {} etc.

asm0dey · 2021-07-19T07:04:12Z

It sounds interesting, but I'm not sure if it may be implemented as Kotlin sequences TBH

…

On Sun, Jul 18, 2021 at 1:20 AM Jolan Rensen ***@***.***> wrote: @asm0dey <https://github.com/asm0dey> yes that's right. That's already possible, but it might be heavy, since it needs to first collect everything in memory and then run the stdlib functions on the resulting list. However, if we would reimplement all or most stdlib functions for Datasets separately using the map and filter functions already present in Spark, it might be possible to make them more efficient. I'll try a few out to see if it will work. In the best case we get some extra helpful functions like mapNotNull {} etc. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#100 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ4XAUZJ5VYP2GMTN4KHALTYH62FANCNFSM5ARAIWIA> .

Jolanrensen · 2021-07-19T16:37:52Z

@asm0dey Yeah you're right. It's somewhere in between. However, for some functions, I'm not sure how efficient they will be. For instance, first { it.a > 0 }. If you iterate over the entire Dataset all will be loaded into memory, so currently I implemented is as filter { it.a > 0 }.first().

asm0dey · 2021-07-20T07:36:02Z

Yes, in some cases it could really make sense to implement such wrappers!

Jolanrensen · 2021-07-20T20:50:56Z

@asm0dey
What should I do? In many cases I have the option to go a spark-only route or convert it to an iterable first. In almost all cases, the spark-only route is slower (up to 10 times), but the iterable route takes more memory. Which should I pick?

For example:

inline operator fun <reified T> Dataset<T>.contains(element: T): Boolean = !filter { it == element }.isEmpty

vs

inline operator fun <reified T> Dataset<T>.contains(element: T): Boolean = Iterable<T> { toLocalIterator() }.contains(element)

asm0dey · 2021-07-20T21:19:54Z

Both of course :)
There is no right solution, one should choose wisely and with all possible information available. So it may be like datasetContains vs iterableContains, I think.

Jolanrensen · 2021-07-21T12:14:08Z

@asm0dey Alright, but one needs to be the default, because it's an operator function in this case :)

asm0dey · 2021-07-21T13:14:46Z

I would go with no-OOMing implementation as default :)

Jolanrensen · 2021-07-21T13:44:49Z

@asm0dey I agree!

Is there an annotation that can give tips to users aside from @Deprecated?
The standard library uses it for instance when you type listOf(1, 2, 3).filter { it > 1 }.first(), advising the user to replace it with listOf(1, 2, 3).first { it > 1 } for speed and readability.

asm0dey · 2021-07-21T13:47:31Z

ReplaceWith IIRC, but it's barely documented :(

asm0dey · 2021-07-21T13:48:44Z

But what you're saying is inspection, not annotation in't it?

Jolanrensen · 2021-07-21T13:49:42Z

@asm0dey Yes I think you're right... Inspections need an IntelliJ plugin, don't they?

asm0dey · 2021-07-21T13:59:13Z

Hypothetically they may be user-provided, but we can't provide them from inside library.

Jolanrensen · 2021-07-21T14:07:55Z

@asm0dey
I found this indeed: https://www.jetbrains.com/help/idea/creating-custom-inspections.html

But it's also possible to make a small plugin https://plugins.jetbrains.com/docs/intellij/code-inspections.html
Might also be handy to point users to reduceK {} from reduce {} etc. I'll see how easy it would be to use something like that.

Jolanrensen · 2021-07-26T20:55:42Z

This is actually quite interesting. It's a bit hard due to lack of documentation, but by using samples from SimplifiableCallInspection.kt, for example, I do manage to create simple hints for users.

I'll probably first finish the stdlib functions itself and afterward look at the plugin again, but, as a proof of concept, it does work :).

asm0dey · 2021-07-26T21:17:48Z

Impressive work!

…

On Mon, Jul 26, 2021 at 11:55 PM Jolan Rensen ***@***.***> wrote: [image: image] <https://user-images.githubusercontent.com/17594275/127057568-185a94a3-c0ed-4998-9fa8-af6a1be8d217.png> This is actually quite interesting. It's a bit hard due to lack of documentation, but by using samples from SimplifiableCallInspection.kt, for example, I do manage to create simple hints for users. I'll probably first finish the stdlib functions itself and afterward look at the plugin again, but, as a proof of concept, it does work :). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#100 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ4XAQQBAQQUM4TBTUTJELTZXDVRANCNFSM5ARAIWIA> .

Jolanrensen mentioned this issue Jul 30, 2021

Feat: (WIP) Stdlib functions #102

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding stdlib functions #100

Regarding stdlib functions #100

Jolanrensen commented Jul 17, 2021

asm0dey commented Jul 17, 2021

Jolanrensen commented Jul 17, 2021

asm0dey commented Jul 19, 2021 via email

Jolanrensen commented Jul 19, 2021

asm0dey commented Jul 20, 2021

Jolanrensen commented Jul 20, 2021

asm0dey commented Jul 20, 2021

Jolanrensen commented Jul 21, 2021

asm0dey commented Jul 21, 2021

Jolanrensen commented Jul 21, 2021

asm0dey commented Jul 21, 2021

asm0dey commented Jul 21, 2021

Jolanrensen commented Jul 21, 2021

asm0dey commented Jul 21, 2021

Jolanrensen commented Jul 21, 2021

Jolanrensen commented Jul 26, 2021

asm0dey commented Jul 26, 2021 via email

Regarding stdlib functions #100

Regarding stdlib functions #100

Comments

Jolanrensen commented Jul 17, 2021

asm0dey commented Jul 17, 2021

Jolanrensen commented Jul 17, 2021

asm0dey commented Jul 19, 2021 via email

Jolanrensen commented Jul 19, 2021

asm0dey commented Jul 20, 2021

Jolanrensen commented Jul 20, 2021

asm0dey commented Jul 20, 2021

Jolanrensen commented Jul 21, 2021

asm0dey commented Jul 21, 2021

Jolanrensen commented Jul 21, 2021

asm0dey commented Jul 21, 2021

asm0dey commented Jul 21, 2021

Jolanrensen commented Jul 21, 2021

asm0dey commented Jul 21, 2021

Jolanrensen commented Jul 21, 2021

Jolanrensen commented Jul 26, 2021

asm0dey commented Jul 26, 2021 via email